Prepared & taught by Andrew Fair, for NYULH Data Day-to-Day (Feb 15, 2022)
Associate Director of Research Environment, Dept of Population Health
My background is in public health & informatics. I've taken a few GIS & web mapping classes.
andrew.fair@nyulangone.org
A Geographic Information System (GIS) is a computer system that analyzes and displays geographically referenced information. It uses data that is attached to a unique location.
https://www.usgs.gov/faqs/what-geographic-information-system-gis
Some examples of how GIS can be used in public health:
GIS is much more than mapping! Most of our lesson today won't involve maps.
But let's be honest, maps are fun!
It's true that GIS generally entails geographically referenced data. But there is a lot you can do without ever running a geocoding program, much of which doesn't involve addresses or latitude/longitude points at all.
ArcGIS is a product sold by a company (ESRI).
This isn't like Kleenex and tissues or Google and, well, "google."
There are lots of other software options for GIS, many of which are free and open source.
If you want a GUI like ArcGIS, QGIS is one such option.
Today, we'll be using free, open-source code-based examples using Python.
Map projections try to portray the surface of the earth, or a portion of the earth, on a flat piece of paper or computer screen. In layman’s term, map projections try to transform the earth from its spherical shape (3D) to a planar shape (2D).
A coordinate reference system (CRS) then defines how the two-dimensional, projected map in your GIS relates to real places on the earth. The decision of which map projection and CRS to use depends on the regional extent of the area you want to work in, on the analysis you want to do, and often on the availability of data.
Key points:
https://docs.qgis.org/3.16/en/docs/gentle_gis_introduction/coordinate_reference_systems.html
The shapefile format is a geospatial vector data format for geographic information system (GIS) software. It is developed and regulated by Esri as a mostly open specification for data interoperability among Esri and other GIS software products.
The shapefile format can spatially describe vector features: points, lines, and polygons.
GeoJSON is an open standard format designed for representing simple geographical features, along with their non-spatial attributes. It is based on the JSON format.
The features include points (therefore addresses and locations), line strings (therefore streets, highways and boundaries), polygons (countries, provinces, tracts of land), and multi-part collections of these types. GeoJSON features need not represent entities of the physical world only; mobile routing and navigation apps, for example, might describe their service coverage using GeoJSON.
The GeoJSON format differs from other GIS standards in that it was written and is maintained not by a formal standards organization, but by an Internet working group of developers.
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [102.0, 0.5]
},
"properties": {
"prop0": "value0"
}
},
{
"type": "Feature",
"geometry": {
"type": "LineString",
"coordinates": [
[102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
]
},
"properties": {
"prop0": "value0",
"prop1": 0.0
}
},
{
"type": "Feature",
"geometry": {
"type": "Polygon",
"coordinates": [
[
[100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
[100.0, 1.0], [100.0, 0.0]
]
]
},
"properties": {
"prop0": "value0",
"prop1": { "this": "that" }
}
}
]
}
TopoJSON is like GeoJSON, but files are generally smaller. Shared boundaries don't require duplicate representations in the schema.
Geocoding is the process of transforming a description of a location—such as a pair of coordinates, an address, or a name of a place—to a location on the earth's surface.
https://desktop.arcgis.com/en/arcmap/latest/manage-data/geocoding/what-is-geocoding.htm
Example using Geocodio
https://www.geocod.io/docs/?python#single-address
from geocodio import GeocodioClient
client = GeocodioClient('YOUR_KEY')
location = client.geocode('180 Madison Ave, New York, NY')
location
{'input': {'address_components': {'number': '180',
'street': 'Madison',
'suffix': 'Ave',
'formatted_street': 'Madison Ave',
'city': 'New York',
'state': 'NY',
'country': 'US'},
'formatted_address': '180 Madison Ave, New York, NY'},
'results': [{'address_components': {'number': '180',
'street': 'Madison',
'suffix': 'Ave',
'formatted_street': 'Madison Ave',
'city': 'New York',
'county': 'New York County',
'state': 'NY',
'zip': '10016',
'country': 'US'},
'formatted_address': '180 Madison Ave, New York, NY 10016',
'location': {'lat': 40.747465, 'lng': -73.983374},
'accuracy': 1,
'accuracy_type': 'rooftop',
'source': 'City of New York'}]}
import requests
import pandas as pd
import numpy as np
import json
import os
https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-selfac.page
df_fac = pd.read_csv(os.path.join('facilities_20210811csv', 'FacDB_20210811.csv'), dtype={'CENSTRACT': str}, encoding='latin-1')
df_fac.columns.tolist()
['OBJECTID', 'FACNAME', 'ADDRESSNUM', 'STREETNAME', 'ADDRESS', 'CITY', 'ZIPCODE', 'BORO', 'BOROCODE', 'BIN', 'BBL', 'CD', 'NTA', 'COUNCIL', 'SCHOOLDIST', 'POLICEPRCT', 'CENSTRACT', 'FACTYPE', 'FACSUBGRP', 'FACGROUP', 'FACDOMAIN', 'SERVAREA', 'OPNAME', 'OPABBREV', 'OPTYPE', 'OVERAGENCY', 'OVERABBREV', 'OVERLEVEL', 'CAPACITY', 'CAPTYPE', 'LATITUDE', 'LONGITUDE', 'XCOORD', 'YCOORD', 'DATASOURCE', 'UID']
df_fac.shape
(33620, 36)
df_fac.head()
| OBJECTID | FACNAME | ADDRESSNUM | STREETNAME | ADDRESS | CITY | ZIPCODE | BORO | BOROCODE | BIN | ... | OVERABBREV | OVERLEVEL | CAPACITY | CAPTYPE | LATITUDE | LONGITUDE | XCOORD | YCOORD | DATASOURCE | UID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | METRO TECH CAREER INSTITUTE INC | 312 | WEST 36 STREET | 312 WEST 36 STREET | NEW YORK | 10018.0 | MANHATTAN | 1 | 1080495 | ... | NYCDOE | City | 0 | NaN | 40.753564 | -73.993278 | 9.861123e+05 | 213820.362752 | nysed_activeinstitutions | 001ccf118b33a4938bc23d091bf75b4b |
| 1 | 2 | CAMBA INC | 19 | WINTHROP STREET | 19 WINTHROP STREET | BROOKLYN | 11225.0 | BROOKLYN | 3 | 3115690 | ... | NYCDOHMH | City | 0 | NaN | 40.656712 | -73.959488 | 9.954905e+05 | 178536.859900 | moeo_socialservicesitelocations | 00233a82fe4e4f419cb8ba13945ae120 |
| 2 | 3 | THE CHILDRENS AID SOCIETY | 340 | MORRIS AVENUE | 340 MORRIS AVENUE | BRONX | 10451.0 | BRONX | 2 | 2091127 | ... | NYCDYCD | City | 0 | NaN | 40.813682 | -73.924737 | 1.005083e+06 | 235732.500190 | moeo_socialservicesitelocations | 002aa1ec95bc76f71581aeb3acbe3f1c |
| 3 | 4 | PROJECT RENEWAL INC | 8 | EAST 3 STREET | 8 EAST 3 STREET | NEW YORK | 10003.0 | MANHATTAN | 1 | 1006546 | ... | NYCDOHMH | City | 0 | NaN | 40.725864 | -73.990997 | 9.867455e+05 | 203728.386639 | moeo_socialservicesitelocations | 002cd093804f8431896cf73aaa022fb4 |
| 4 | 5 | UNITED ACTIVITIES UNLIMITED UAU | 15 | WILLIAM STREET | 15 WILLIAM STREET | STATEN ISLAND | 10304.0 | STATEN ISLAND | 5 | 5013400 | ... | NYCDYCD | City | 0 | NaN | 40.631112 | -74.077172 | 9.628295e+05 | 169216.938105 | moeo_socialservicesitelocations | 002fc9ecd0fb02985331d7cc7c517c3f |
5 rows × 36 columns
df_fac['FACTYPE'].value_counts()
DAY CARE 1777
SUMMER ONLY FEEDING SITE 1763
COMMERCIAL GARAGE 1408
CITY COUNCIL AWARDS 1242
STATE HISTORIC PLACE 1002
...
NON-MEDICAID CARE COORDINATION; (NON-LICENSED PROGRAM)-OMH, NON-MEDICAID CARE COORDINATION (OMH)-OMH 1
NYCHA COMMUNITY CENTER - HEAD START / JOBS PLUS / COMMUNITY CENTER 1
MEDICALLY SUPERVISED OUTPATIENT-OASAS, COMPULSIVE GAMBLING EDUCATION-OASAS 1
NYCHA COMMUNITY CENTER - EDUCATION /LITERACY CENTER 1
NYCHA COMMUNITY CENTER - COMMUNITY CENTER / SPORTS 1
Name: FACTYPE, Length: 604, dtype: int64
df_fac = df_fac[df_fac['FACTYPE']=='HOSPITAL']
df_fac.shape
(61, 36)
df_fac['OPNAME'].value_counts()
NYC Health and Hospitals Corporation 12 The New York and Presbyterian Hospital 5 Montefiore Medical Center 4 NYU Langone Hospitals 4 The Brookdale Hospital Medical Center 3 Staten Island University Hospital 2 St Lukes Roosevelt Hospital Center Inc 2 Memorial Hospital for Cancer and Allied Diseases Inc 2 Beth Israel Medical Center Inc 2 Lenox Hill Hospital 2 The Mount Sinai Hospital 2 Calvary Hospital Inc. 2 BronxCare Health System 2 NYS Department of Health 1 Jamaica Hospital Inc 1 Bridge Regional Health System 1 Wyckoff Heights Medical Center 1 NewYork-Presbyterian/Brooklyn Methodist 1 The Brooklyn Hospital Center 1 NewYork-Presbyterian/Queens 1 Flushing Hospital and Medical Center Inc 1 Long Island Jewish Medical Center 1 NY Eye and Ear Infirmary Inc 1 NY Society for the Relief of The Ruptured & Crippled 1 St Barnabas Hospital Inc 1 New York Community Hospital of Brooklyn,Inc 1 Richmond University Medical Center 1 The Rockefeller University Inc 1 Episcopal Health Services Inc 1 Maimonides Medical Center 1 Name: OPNAME, dtype: int64
year = '2019'
dsource = 'acs'
dname = 'acs5'
cols = 'NAME,B19013_001E' # MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS)
state = '36' # New York State
county = '005,047,061,081,085' # the five boroughs
tract = '*' # all tracts
base_url = f'https://api.census.gov/data/{year}/{dsource}/{dname}'
data_url = f'{base_url}?get={cols}&for=tract:{tract}&in=state:{state}&in=county:{county}'
data_url
'https://api.census.gov/data/2019/acs/acs5?get=NAME,B19013_001E&for=tract:*&in=state:36&in=county:005,047,061,081,085'
response=requests.get(data_url)
data = response.json()
df_ct = pd.DataFrame(data[1:], columns=data[0])
df_ct.head()
| NAME | B19013_001E | state | county | tract | |
|---|---|---|---|---|---|
| 0 | Census Tract 361, Queens County, New York | 47419 | 36 | 081 | 036100 |
| 1 | Census Tract 363, Queens County, New York | 57802 | 36 | 081 | 036300 |
| 2 | Census Tract 371, Queens County, New York | 67419 | 36 | 081 | 037100 |
| 3 | Census Tract 377, Queens County, New York | 71728 | 36 | 081 | 037700 |
| 4 | Census Tract 379, Queens County, New York | 72799 | 36 | 081 | 037900 |
df_ct.shape
(2167, 5)
I don't want to have to type 'B19013_001E' each time I call this column
df_ct.rename(columns={'B19013_001E': 'income'}, inplace=True)
df_ct['income'] = df_ct['income'].astype('int')
Quick descriptive statistics. Something looks weird...
df_ct['income'].describe()
count 2.167000e+03 mean -2.208168e+07 std 1.195239e+08 min -6.666667e+08 25% 4.541350e+04 50% 6.494600e+04 75% 8.605150e+04 max 2.500010e+05 Name: income, dtype: float64
df_ct[df_ct['income']<0]
| NAME | income | state | county | tract | |
|---|---|---|---|---|---|
| 50 | Census Tract 154, Richmond County, New York | -666666666 | 36 | 085 | 015400 |
| 70 | Census Tract 319, Bronx County, New York | -666666666 | 36 | 005 | 031900 |
| 81 | Census Tract 793, Queens County, New York | -666666666 | 36 | 081 | 079300 |
| 153 | Census Tract 641.02, Queens County, New York | -666666666 | 36 | 081 | 064102 |
| 193 | Census Tract 50, Queens County, New York | -666666666 | 36 | 081 | 005000 |
| ... | ... | ... | ... | ... | ... |
| 2047 | Census Tract 107.01, Queens County, New York | -666666666 | 36 | 081 | 010701 |
| 2056 | Census Tract 171, Queens County, New York | -666666666 | 36 | 081 | 017100 |
| 2064 | Census Tract 852, Kings County, New York | -666666666 | 36 | 047 | 085200 |
| 2095 | Census Tract 716, Queens County, New York | -666666666 | 36 | 081 | 071600 |
| 2138 | Census Tract 246, Queens County, New York | -666666666 | 36 | 081 | 024600 |
72 rows × 5 columns
df_ct['income'][df_ct['income']<0].unique()
array([-666666666])
https://www.census.gov/data/developers/data-sets/acs-1year/notes-on-acs-estimate-and-annotation-values.html
-666666666: The estimate could not be computed because there were an insufficient number of sample observations.
We'll replace it with NaN
df_ct['income'].replace(-666666666, np.nan, inplace=True)
df_ct['income'].describe()
count 2095.000000 mean 71126.189976 std 34767.467970 min 9740.000000 25% 47705.000000 50% 66534.000000 75% 86642.000000 max 250001.000000 Name: income, dtype: float64
We cannot use 'tract' alone as a unique key. It's not unique across counties. We can combine it with 'county' for a unique key.
df_ct.sort_values('tract')
| NAME | income | state | county | tract | |
|---|---|---|---|---|---|
| 1959 | Census Tract 1, Queens County, New York | 150421.0 | 36 | 081 | 000100 |
| 336 | Census Tract 1, Bronx County, New York | NaN | 36 | 005 | 000100 |
| 849 | Census Tract 1, New York County, New York | NaN | 36 | 061 | 000100 |
| 2065 | Census Tract 1, Kings County, New York | 96250.0 | 36 | 047 | 000100 |
| 959 | Census Tract 2, Bronx County, New York | 51100.0 | 36 | 005 | 000200 |
| ... | ... | ... | ... | ... | ... |
| 1435 | Census Tract 1617, Queens County, New York | 117813.0 | 36 | 081 | 161700 |
| 346 | Census Tract 1621, Queens County, New York | 76655.0 | 36 | 081 | 162100 |
| 1049 | Census Tract 9901, Kings County, New York | NaN | 36 | 047 | 990100 |
| 939 | Census Tract 9901, Richmond County, New York | NaN | 36 | 085 | 990100 |
| 1147 | Census Tract 9901, Queens County, New York | NaN | 36 | 081 | 990100 |
2167 rows × 5 columns
df_ct['county_ct'] = df_ct['county'] + df_ct['tract']
This is a really good resource on Census GEOIDS:
https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html
Facilities file gives borough code but not county FIPS code. It's easy to re-map.
Borough codes are here:
https://en.wikipedia.org/wiki/Borough,_Block_and_Lot
And I can see the county FIPS code correspondences in the data table above.
boro_map = {'1': '061', '2': '005', '3': '047', '4': '081', '5': '085'}
df_fac['county_code'] = df_fac['BOROCODE'].astype(str).map(boro_map)
df_fac['county_ct'] = df_fac['county_code'] + df_fac['CENSTRACT']
Our first join. This will give us the hospitals merged with the Census Tract-level income data.
df_hosp_income = df_fac.merge(df_ct, how='left')
df_hosp_income.columns.tolist()
['OBJECTID', 'FACNAME', 'ADDRESSNUM', 'STREETNAME', 'ADDRESS', 'CITY', 'ZIPCODE', 'BORO', 'BOROCODE', 'BIN', 'BBL', 'CD', 'NTA', 'COUNCIL', 'SCHOOLDIST', 'POLICEPRCT', 'CENSTRACT', 'FACTYPE', 'FACSUBGRP', 'FACGROUP', 'FACDOMAIN', 'SERVAREA', 'OPNAME', 'OPABBREV', 'OPTYPE', 'OVERAGENCY', 'OVERABBREV', 'OVERLEVEL', 'CAPACITY', 'CAPTYPE', 'LATITUDE', 'LONGITUDE', 'XCOORD', 'YCOORD', 'DATASOURCE', 'UID', 'county_code', 'county_ct', 'NAME', 'income', 'state', 'county', 'tract']
More columns than we need, so let's subset.
df_hosp_income = df_hosp_income[['FACNAME', 'OPNAME', 'LATITUDE', 'LONGITUDE', 'income', 'county_ct']]
I did have some null values. I'm just going to ignore that. This probably woudn't be a great analysis IRL.
df_hosp_income[df_hosp_income['income'].isnull()]
| FACNAME | OPNAME | LATITUDE | LONGITUDE | income | county_ct | |
|---|---|---|---|---|---|---|
| 11 | RUMC-BAYLEY SETON | Richmond University Medical Center | 40.622738 | -74.075356 | NaN | 085002700 |
| 44 | NEW YORK-PRESBYTERIAN HOSPITAL - ALLEN HOSPITAL | The New York and Presbyterian Hospital | 40.873338 | -73.913024 | NaN | 061029700 |
| 57 | CALVARY HOSPITAL INC | Calvary Hospital Inc. | 40.847930 | -73.843916 | NaN | 005028400 |
Split the dataset based on whether they're H+H hospitals or not.
df_pub_hosp = df_hosp_income[df_hosp_income['OPNAME']=='NYC Health and Hospitals Corporation']
df_priv_hosp = df_hosp_income[df_hosp_income['OPNAME']!='NYC Health and Hospitals Corporation']
df_pub_hosp['income'].mean()
57697.5
df_priv_hosp['income'].mean()
83337.69565217392
Not surprisingly, average income is higher in the neighborhoods (defined by Census Tract) where private hospitals are located.
I can mostly reuse all the API parameters I established already:
jump up
Except I want to fetch Block Groups this time.
bg = '*'
data_url = f'{base_url}?get={cols}&for=block%20group:{bg}&in=state:{state}&in=county:{county}'
data_url
'https://api.census.gov/data/2019/acs/acs5?get=NAME,B19013_001E&for=block%20group:*&in=state:36&in=county:005,047,061,081,085'
Now I'm going to more or less repeat a bunch of steps from above, to get Block Groups.
response=requests.get(data_url)
data = response.json()
df_cbg = pd.DataFrame(data[1:], columns=data[0])
df_cbg
| NAME | B19013_001E | state | county | tract | block group | |
|---|---|---|---|---|---|---|
| 0 | Block Group 2, Census Tract 71, Queens County,... | 82829 | 36 | 081 | 007100 | 2 |
| 1 | Block Group 4, Census Tract 117, Queens County... | 91667 | 36 | 081 | 011700 | 4 |
| 2 | Block Group 1, Census Tract 117, Queens County... | 115000 | 36 | 081 | 011700 | 1 |
| 3 | Block Group 2, Census Tract 117, Queens County... | 78333 | 36 | 081 | 011700 | 2 |
| 4 | Block Group 3, Census Tract 117, Queens County... | 55000 | 36 | 081 | 011700 | 3 |
| ... | ... | ... | ... | ... | ... | ... |
| 6488 | Block Group 1, Census Tract 426, Queens County... | -666666666 | 36 | 081 | 042600 | 1 |
| 6489 | Block Group 3, Census Tract 929, Queens County... | 66369 | 36 | 081 | 092900 | 3 |
| 6490 | Block Group 1, Census Tract 28, Queens County,... | 63841 | 36 | 081 | 002800 | 1 |
| 6491 | Block Group 2, Census Tract 28, Queens County,... | 50446 | 36 | 081 | 002800 | 2 |
| 6492 | Block Group 1, Census Tract 320, Queens County... | 73929 | 36 | 081 | 032000 | 1 |
6493 rows × 6 columns
df_cbg.rename(columns={'B19013_001E': 'income'}, inplace=True)
df_cbg['income'] = df_cbg['income'].astype('int')
df_cbg['income'].describe()
count 6.493000e+03 mean -7.611793e+07 std 2.121385e+08 min -6.666667e+08 25% 3.691200e+04 50% 6.263900e+04 75% 8.898800e+04 max 2.500010e+05 Name: income, dtype: float64
df_cbg['income'].replace(-666666666, np.nan, inplace=True)
df_cbg['income'].describe()
count 5751.000000 mean 75279.555034 std 41962.847637 min 8493.000000 25% 46599.000000 50% 68164.000000 75% 93750.000000 max 250001.000000 Name: income, dtype: float64
But! My facilities file didn't have Block Groups as an attribute. So I need to do a spatial join this time (fancy!).
Geopandas makes it easy.
I'm going to time some of these steps.
import geopandas as gpd
from pyproj import CRS
import time
C:\Users\andre\anaconda3\envs\geo_env\lib\site-packages\pyproj\__init__.py:91: UserWarning: Valid PROJ data directory not found. Either set the path using the environmental variable PROJ_LIB or with `pyproj.datadir.set_data_dir`. warnings.warn(str(err))
https://www.census.gov/cgi-bin/geo/shapefiles/index.php
Select year: 2010, layer type: Block Groups, submit
Select state: New York, download state file
cbg_name = 'tl_2010_36_bg10'
st = time.time()
read_name = os.path.join(cbg_name, f'{cbg_name}.shp')
print(read_name)
gdf_cbg_poly = gpd.read_file(read_name)
print(time.time() - st)
tl_2010_36_bg10\tl_2010_36_bg10.shp 2.921198844909668
Geopandas knows how to read a shapefile
gdf_cbg_poly.head()
| STATEFP10 | COUNTYFP10 | TRACTCE10 | BLKGRPCE10 | GEOID10 | NAMELSAD10 | MTFCC10 | FUNCSTAT10 | ALAND10 | AWATER10 | INTPTLAT10 | INTPTLON10 | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 36 | 007 | 012101 | 1 | 360070121011 | Block Group 1 | G5030 | S | 17263315 | 38069 | +42.2174896 | -075.8790269 | POLYGON ((-75.89054 42.18697, -75.89920 42.178... |
| 1 | 36 | 007 | 012101 | 4 | 360070121014 | Block Group 4 | G5030 | S | 5586521 | 104976 | +42.1831807 | -075.8760650 | POLYGON ((-75.87509 42.17416, -75.87541 42.174... |
| 2 | 36 | 007 | 012101 | 3 | 360070121013 | Block Group 3 | G5030 | S | 1845677 | 0 | +42.1931188 | -075.8462160 | POLYGON ((-75.83837 42.19740, -75.83813 42.197... |
| 3 | 36 | 007 | 012102 | 4 | 360070121024 | Block Group 4 | G5030 | S | 5082696 | 183811 | +42.1823571 | -075.8436840 | POLYGON ((-75.85079 42.16052, -75.85114 42.160... |
| 4 | 36 | 007 | 012102 | 3 | 360070121023 | Block Group 3 | G5030 | S | 1094202 | 145781 | +42.1677108 | -075.8793775 | POLYGON ((-75.87457 42.17416, -75.87338 42.174... |
gdf_cbg_poly.shape
(15464, 13)
Check the coordinate reference system:
gdf_cbg_poly.crs
<Geographic 2D CRS: EPSG:4269> Name: NAD83 Axis Info [ellipsoidal]: - Lat[north]: Geodetic latitude (degree) - Lon[east]: Geodetic longitude (degree) Area of Use: - name: North America - onshore and offshore: Canada - Alberta; British Columbia; Manitoba; New Brunswick; Newfoundland and Labrador; Northwest Territories; Nova Scotia; Nunavut; Ontario; Prince Edward Island; Quebec; Saskatchewan; Yukon. Puerto Rico. United States (USA) - Alabama; Alaska; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Hawaii; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming. US Virgin Islands. British Virgin Islands. - bounds: (167.65, 14.92, -47.74, 86.46) Datum: North American Datum 1983 - Ellipsoid: GRS 1980 - Prime Meridian: Greenwich
https://epsg.io/4269
United States (USA) - New York., accuracy 1.0 m
Look how simple it is to plot! It ain't pretty, but it's easy.
st = time.time()
gdf_cbg_poly.plot()
print(time.time() - st)
14.393245697021484
Now I have all the Census Block Groups in NYS, and I want to do a spatial join with the hospital facilities.
To do this, I first need to create a GeoDataFrame from the facilities using the latitude/longitude.
df_fac.head()
| OBJECTID | FACNAME | ADDRESSNUM | STREETNAME | ADDRESS | CITY | ZIPCODE | BORO | BOROCODE | BIN | ... | CAPACITY | CAPTYPE | LATITUDE | LONGITUDE | XCOORD | YCOORD | DATASOURCE | UID | county_code | county_ct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 74 | 75 | MONTEFIORE MEDICAL CENTER - MONTEFIORE WESTCHE... | 2475 | ST RAYMOND AVENUE | 2475 ST RAYMOND AVENUE | BRONX | 10461.0 | BRONX | 2 | 2042138 | ... | 0 | NaN | 40.840678 | -73.848402 | 1.026197e+06 | 245595.392737 | nysdoh_healthfacilities | 022083b2659c412600184c84c6af2601 | 005 | 005020400 |
| 143 | 144 | NEW YORK EYE AND EAR INFIRMARY OF MOUNT SINAI | 310 | EAST 14 STREET | 310 EAST 14 STREET | NEW YORK | 10003.0 | MANHATTAN | 1 | 1006511 | ... | 0 | NaN | 40.731875 | -73.984599 | 9.885183e+05 | 205918.769890 | nysdoh_healthfacilities | 048b821f461772caf39396bece7f1510 | 061 | 061004000 |
| 325 | 326 | STATEN ISLAND UNIVERSITY HOSP-SOUTH | 375 | SEGUINE AVENUE | 375 SEGUINE AVENUE | STATEN ISLAND | 10309.0 | STATEN ISLAND | 5 | 5082757 | ... | 0 | NaN | 40.516842 | -74.196283 | 9.296750e+05 | 127637.165484 | nysdoh_healthfacilities | 098ab7c2096eac1c31fc33dde32aa91e | 085 | 085019800 |
| 1011 | 1012 | ST JOHNS EPISCOPAL HOSPITAL SO SHORE | 327 | BEACH 19 STREET | 327 BEACH 19 STREET | FAR ROCKAWAY | 11691.0 | QUEENS | 4 | 4430537 | ... | 0 | NaN | 40.598573 | -73.753467 | 1.052713e+06 | 157448.989848 | nysdoh_healthfacilities | 1c3397784de33a4b33a57ca7c8bde383 | 081 | 081099801 |
| 1091 | 1092 | WOODHULL MEDICAL & MENTAL HEALTH CENTER | 760 | BROADWAY | 760 BROADWAY | BROOKLYN | 11206.0 | BROOKLYN | 3 | 3048341 | ... | 0 | NaN | 40.699576 | -73.942388 | 1.000225e+06 | 194156.201122 | nysdoh_healthfacilities | 1e706a2cf449424e2aaed6421b75ffe3 | 047 | 047028501 |
5 rows × 38 columns
This is the syntax to create a GeoDataFrame using Geopandas.
It's important to set the CRS to the same as was used for the Census Block Groups:
gdf_fac_points = gpd.GeoDataFrame(df_fac, geometry=gpd.points_from_xy(df_fac['LONGITUDE'], df_fac['LATITUDE']), crs='EPSG:4269')
gdf_fac_points.head()
| OBJECTID | FACNAME | ADDRESSNUM | STREETNAME | ADDRESS | CITY | ZIPCODE | BORO | BOROCODE | BIN | ... | CAPTYPE | LATITUDE | LONGITUDE | XCOORD | YCOORD | DATASOURCE | UID | county_code | county_ct | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 74 | 75 | MONTEFIORE MEDICAL CENTER - MONTEFIORE WESTCHE... | 2475 | ST RAYMOND AVENUE | 2475 ST RAYMOND AVENUE | BRONX | 10461.0 | BRONX | 2 | 2042138 | ... | NaN | 40.840678 | -73.848402 | 1.026197e+06 | 245595.392737 | nysdoh_healthfacilities | 022083b2659c412600184c84c6af2601 | 005 | 005020400 | POINT (-73.84840 40.84068) |
| 143 | 144 | NEW YORK EYE AND EAR INFIRMARY OF MOUNT SINAI | 310 | EAST 14 STREET | 310 EAST 14 STREET | NEW YORK | 10003.0 | MANHATTAN | 1 | 1006511 | ... | NaN | 40.731875 | -73.984599 | 9.885183e+05 | 205918.769890 | nysdoh_healthfacilities | 048b821f461772caf39396bece7f1510 | 061 | 061004000 | POINT (-73.98460 40.73188) |
| 325 | 326 | STATEN ISLAND UNIVERSITY HOSP-SOUTH | 375 | SEGUINE AVENUE | 375 SEGUINE AVENUE | STATEN ISLAND | 10309.0 | STATEN ISLAND | 5 | 5082757 | ... | NaN | 40.516842 | -74.196283 | 9.296750e+05 | 127637.165484 | nysdoh_healthfacilities | 098ab7c2096eac1c31fc33dde32aa91e | 085 | 085019800 | POINT (-74.19628 40.51684) |
| 1011 | 1012 | ST JOHNS EPISCOPAL HOSPITAL SO SHORE | 327 | BEACH 19 STREET | 327 BEACH 19 STREET | FAR ROCKAWAY | 11691.0 | QUEENS | 4 | 4430537 | ... | NaN | 40.598573 | -73.753467 | 1.052713e+06 | 157448.989848 | nysdoh_healthfacilities | 1c3397784de33a4b33a57ca7c8bde383 | 081 | 081099801 | POINT (-73.75347 40.59857) |
| 1091 | 1092 | WOODHULL MEDICAL & MENTAL HEALTH CENTER | 760 | BROADWAY | 760 BROADWAY | BROOKLYN | 11206.0 | BROOKLYN | 3 | 3048341 | ... | NaN | 40.699576 | -73.942388 | 1.000225e+06 | 194156.201122 | nysdoh_healthfacilities | 1e706a2cf449424e2aaed6421b75ffe3 | 047 | 047028501 | POINT (-73.94239 40.69958) |
5 rows × 39 columns
gdf_fac_points.shape
(61, 39)
Plotting these 61 points is much quicker.
You can just barely make out the shape of NYC.
st = time.time()
gdf_fac_points.plot()
print(time.time() - st)
0.15159368515014648
Double check the CRS's match
gdf_cbg_poly.crs == gdf_fac_points.crs
True
gdf_cbg_poly.columns.tolist()
['STATEFP10', 'COUNTYFP10', 'TRACTCE10', 'BLKGRPCE10', 'GEOID10', 'NAMELSAD10', 'MTFCC10', 'FUNCSTAT10', 'ALAND10', 'AWATER10', 'INTPTLAT10', 'INTPTLON10', 'geometry']
Now, everything is set up for the spatial join.
I'm doing a left join onto the hospital points, so I will lop off all the extraneous block groups and be left with only 61.
st = time.time()
gdf_join = gpd.sjoin(gdf_fac_points, gdf_cbg_poly[['COUNTYFP10', 'TRACTCE10', 'BLKGRPCE10', 'geometry']], how='left')
print(time.time() - st)
2.7657508850097656
gdf_join.shape
(61, 43)
gdf_join.head()
| OBJECTID | FACNAME | ADDRESSNUM | STREETNAME | ADDRESS | CITY | ZIPCODE | BORO | BOROCODE | BIN | ... | YCOORD | DATASOURCE | UID | county_code | county_ct | geometry | index_right | COUNTYFP10 | TRACTCE10 | BLKGRPCE10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 74 | 75 | MONTEFIORE MEDICAL CENTER - MONTEFIORE WESTCHE... | 2475 | ST RAYMOND AVENUE | 2475 ST RAYMOND AVENUE | BRONX | 10461.0 | BRONX | 2 | 2042138 | ... | 245595.392737 | nysdoh_healthfacilities | 022083b2659c412600184c84c6af2601 | 005 | 005020400 | POINT (-73.84840 40.84068) | 14079 | 005 | 020400 | 2 |
| 143 | 144 | NEW YORK EYE AND EAR INFIRMARY OF MOUNT SINAI | 310 | EAST 14 STREET | 310 EAST 14 STREET | NEW YORK | 10003.0 | MANHATTAN | 1 | 1006511 | ... | 205918.769890 | nysdoh_healthfacilities | 048b821f461772caf39396bece7f1510 | 061 | 061004000 | POINT (-73.98460 40.73188) | 3099 | 061 | 004000 | 6 |
| 325 | 326 | STATEN ISLAND UNIVERSITY HOSP-SOUTH | 375 | SEGUINE AVENUE | 375 SEGUINE AVENUE | STATEN ISLAND | 10309.0 | STATEN ISLAND | 5 | 5082757 | ... | 127637.165484 | nysdoh_healthfacilities | 098ab7c2096eac1c31fc33dde32aa91e | 085 | 085019800 | POINT (-74.19628 40.51684) | 1197 | 085 | 019800 | 4 |
| 1011 | 1012 | ST JOHNS EPISCOPAL HOSPITAL SO SHORE | 327 | BEACH 19 STREET | 327 BEACH 19 STREET | FAR ROCKAWAY | 11691.0 | QUEENS | 4 | 4430537 | ... | 157448.989848 | nysdoh_healthfacilities | 1c3397784de33a4b33a57ca7c8bde383 | 081 | 081099801 | POINT (-73.75347 40.59857) | 10468 | 081 | 099801 | 3 |
| 1091 | 1092 | WOODHULL MEDICAL & MENTAL HEALTH CENTER | 760 | BROADWAY | 760 BROADWAY | BROOKLYN | 11206.0 | BROOKLYN | 3 | 3048341 | ... | 194156.201122 | nysdoh_healthfacilities | 1e706a2cf449424e2aaed6421b75ffe3 | 047 | 047028501 | POINT (-73.94239 40.69958) | 4167 | 047 | 028501 | 1 |
5 rows × 43 columns
gdf_join.columns.tolist()
['OBJECTID', 'FACNAME', 'ADDRESSNUM', 'STREETNAME', 'ADDRESS', 'CITY', 'ZIPCODE', 'BORO', 'BOROCODE', 'BIN', 'BBL', 'CD', 'NTA', 'COUNCIL', 'SCHOOLDIST', 'POLICEPRCT', 'CENSTRACT', 'FACTYPE', 'FACSUBGRP', 'FACGROUP', 'FACDOMAIN', 'SERVAREA', 'OPNAME', 'OPABBREV', 'OPTYPE', 'OVERAGENCY', 'OVERABBREV', 'OVERLEVEL', 'CAPACITY', 'CAPTYPE', 'LATITUDE', 'LONGITUDE', 'XCOORD', 'YCOORD', 'DATASOURCE', 'UID', 'county_code', 'county_ct', 'geometry', 'index_right', 'COUNTYFP10', 'TRACTCE10', 'BLKGRPCE10']
But wait! I still don't have the income data attached.
I fetched that from the Census API a while back, but haven't yet attached it. Time to do that.
df_cbg.columns.tolist()
['NAME', 'income', 'state', 'county', 'tract', 'block group']
This time, I don't need to worry about converting borough codes to county codes. The shapefile had the county codes included.
df_hosp_income_cbg = gdf_join.merge(df_cbg, how='left', left_on=['COUNTYFP10', 'TRACTCE10', 'BLKGRPCE10'], right_on=['county', 'tract', 'block group'])
df_hosp_income_cbg.columns.tolist()
['OBJECTID', 'FACNAME', 'ADDRESSNUM', 'STREETNAME', 'ADDRESS', 'CITY', 'ZIPCODE', 'BORO', 'BOROCODE', 'BIN', 'BBL', 'CD', 'NTA', 'COUNCIL', 'SCHOOLDIST', 'POLICEPRCT', 'CENSTRACT', 'FACTYPE', 'FACSUBGRP', 'FACGROUP', 'FACDOMAIN', 'SERVAREA', 'OPNAME', 'OPABBREV', 'OPTYPE', 'OVERAGENCY', 'OVERABBREV', 'OVERLEVEL', 'CAPACITY', 'CAPTYPE', 'LATITUDE', 'LONGITUDE', 'XCOORD', 'YCOORD', 'DATASOURCE', 'UID', 'county_code', 'county_ct', 'geometry', 'index_right', 'COUNTYFP10', 'TRACTCE10', 'BLKGRPCE10', 'NAME', 'income', 'state', 'county', 'tract', 'block group']
Too many columns.
df_hosp_income_cbg = df_hosp_income_cbg[['FACNAME', 'OPNAME', 'LATITUDE', 'LONGITUDE', 'income', 'county_ct']]
df_hosp_income_cbg['income'].isnull().sum()
13
Again, nulls. Again, ignored.
df_pub_hosp_cbg = df_hosp_income_cbg[df_hosp_income_cbg['OPNAME']=='NYC Health and Hospitals Corporation']
df_priv_hosp_cbg = df_hosp_income_cbg[df_hosp_income_cbg['OPNAME']!='NYC Health and Hospitals Corporation']
df_pub_hosp_cbg['income'].isnull().sum()
3
df_pub_hosp_cbg['income'].isnull().sum() / len(df_pub_hosp_cbg)
0.25
df_priv_hosp_cbg['income'].isnull().sum()
10
df_priv_hosp_cbg['income'].isnull().sum() / len(df_priv_hosp_cbg)
0.20408163265306123
Well, at least the null values seem to be roughly nondifferentially distributed among public & private hospitals.
df_pub_hosp_cbg['income'].mean()
61245.77777777778
df_priv_hosp_cbg['income'].mean()
89221.43589743589
Again, private hospitals seem to be in wealthier neighborhoods (by income) than private hospitals. And both are a bit higher when we zero into Block Groups compared to Census Tracts.
If you were really doing this analysis, you'd want to pay attention to margins of error and such.
Interestingly, the word "choropleth" has nothing to do with color.
"Choro" comes from ancient Greek for "region" or "location," whereas "chloro" comes from ancient Greek for "green" (as in "chlorophyll").
And "pleth" is "many," as in "plethora."
But sometimes, choropleth maps are green.
People who identify as Anglican as a fraction of total persons, in Australia, according to the 2011 census results.
By Toby Hudson based on data from the Australian Bureau of Statistics, CC BY-SA 3.0 au, Link
I'm going to re-do my merge, this time adding income data onto all the CBG's in NYS (not only those with hospitals in NYC).
gdf_nyc_income = gdf_cbg_poly.merge(df_cbg, left_on=['COUNTYFP10', 'TRACTCE10', 'BLKGRPCE10'], right_on=['county', 'tract', 'block group'])
Plotly makes really nice interactive graphics
import plotly.express as px
import plotly.graph_objects as go
Set up some layout parameters
z = 10.5
w = 950
h = 1000
lat = 40.7128
lon = -74.0060
lon = -73.95
Here's the choropleth map. That's it!
fig = px.choropleth_mapbox(gdf_nyc_income,
geojson=gdf_nyc_income['geometry'],
locations=gdf_nyc_income.index,
color='income',
color_continuous_scale='Viridis',
hover_name='NAME',
title='Income by Block Group, NYC',
center = {"lat": lat, "lon": lon},
mapbox_style='open-street-map',
zoom=z,
width=w,
height=h
)
fig.show()
Let's add the hospitals.
First, let's create a new variable to bifurcate public & private.
gdf_fac_points['pub_priv'] = np.where(gdf_fac_points['OPNAME']=='NYC Health and Hospitals Corporation', 'PUBLIC', 'PRIVATE')
Set a tasty color scale
color_scale = {'PUBLIC': 'plum', 'PRIVATE': 'tomato'}
And add the points to the plot
fig.add_scattermapbox(lat=gdf_fac_points['geometry'].y,
lon=gdf_fac_points['geometry'].x,
mode='markers',
marker=go.scattermapbox.Marker(
size=12,
color=gdf_fac_points['pub_priv'].map(color_scale),
opacity=0.8
),
showlegend=False,
text=gdf_fac_points['FACNAME'] + '<br><b>' + gdf_fac_points['pub_priv'] + '</b>',
)
My sincere hope is you come away from this session with the feeling that this stuff is fun.